Using Bilingual Materials to Develop Word Sense Disambiguation Methods

نویسندگان

  • William A. Gale
  • Kenneth W. Church
  • David Yarowsky
چکیده

Word sense disambiguation has been recognized as a major problem in natural language processing research for over forty years. Much of this work has been stymied by difficulties in acquiring appropriate lexical resources, such as semantic networks and annotated corpora. Following the suggestion in Brown et al. (1991a) and Dagan et al. (1991), we have achieved considerable progress recently by taking advantage of a new source of testing and training materials. Rather than depending on small amounts of hand-labeled text, we have been making use of relatively large amounts of parallel text, text such as the Canadian Hansards (parliamentary debates), which are available in two (or more) languages. The translation can often be used in lieu of hand-labeling. For example, consider the polysemous word sentence, which has two major senses: (1) a judicial sentence, and (2), a syntactic sentence. We can collect a number of sense (1) examples by extracting instances that are translated as peine, and we can collect a number of sense (2) examples by extracting instances that are translated as phrase. In this way, we have been able to acquire a considerable amount of testing and training material for developing and testing our disambiguation algorithms. The availability of this testing and training material has enabled us to develop quantitative disambiguation methods that achieve 90% accuracy in discriminating between two very distinct senses of a noun such as sentence. In the training phase, we collect a number of instances of each sense of the polysemous noun. Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to one of the senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances using a Bayesian argument that has been applied successfully in related applications such as author identification and information retrieval. The final section of the paper will describe a number of methodological studies which show that the training set need not be large and that it need not be free from errors. Perhaps most surprisingly, we find that the context should extend ±50 words, an order of magnitude larger than one typically finds in the literature. 1. Word-Sense Disambiguation Consider, for example, the word duty which has at least two quite distinct senses: (1) a tax and (2) an obligation. Three examples of each sense are given in Table 1 below. The classic disambiguation problem is to construct a means for discriminating between two or more sets of examples such as those shown in Table 1. This paper will focus on the methodology required to address the classic problem, and will have less to say about the details required for practical application of this methodology. Consequently, the reader should exercise some caution in interpreting the 90% figure reported here; this figure could easily be swamped out in a practical system by any number of factors that go beyond the scope of this paper. In particular, the Canadian Hansards, one of just the few currently available sources of parallel text, is extremely unbalanced, and is therefore severely limited as a basis for a practical disambiguation system. Table 1: Sample Concordances of duty (split into two senses) Sense Examples (from Canadian Hansards) _ _________________________________________________________________________ tax fewer cases of companies paying duty and then claiming a refund and impose a countervailing duty of 29,1 per cent on candian exports of the united states imposed a duty on canadian saltfish last year _ _________________________________________________________________________ obligation it is my honour and duty to present a petition duly approved working well beyond the call of duty ? SENT i know what time they start in addition , it is my duty to present the government ’s comments          Moreover, it is important to distinguish the monolingual word-sense disambiguation problem from the translation issue. It is not always necessary to resolve the word-sense ambiguity in order to translate a polysemous word. Especially in related languages like English and French, it is common for word-sense ambiguity to be preserved in both languages. For example, both the English noun interest and the French equivalent intere ̂ t are multiply ambiguous in both languages in more or less the same ways. Thus, one cannot turn to the French to resolve the ambiguity in the English, since the word is equally ambiguous in both languages. Furthermore, when one word does translate to two (e.g., sentence → peine and phrase), the choice of target translation need not indicate a sense split in the source. Consider, for example, the group of Japanese words translated by ‘‘wearing clothes’’ in English. While the Japanese have five different words for ‘‘wear’’ depending on which part of the body is involved, we doubt that English speakers would ever sort ‘‘wearing shoes’’ and ‘‘wearing shirt’’ into separate categories. These examples indicate that word-sense disambiguation and translation are somewhat different problems. It would have been nice if the translation could always be used in lieu of hand-tagging to resolve the wordsense ambiguity but unfortunately, this is not the case. Nevertheless, the translation is often helpful for resolving the ambiguity. It seems to us to make sense to continue to use the Hansard translations to develop the discrimination methodology, while we continue to seek more appropriate sources of testing and training materials. See Yarowsky (1992) for an application of the methods developed here to a somewhat more appropriate source, a combination of the Roget’s Thesaurus (Chapman, 1977) and the Grolier’s Encyclopedia (1991). 2. Knowledge Acquisition Bottleneck In our view, the crux of the problem in developing methods for word sense disambiguation is to find a strategy for acquiring a sufficiently large set of training material. We think that we have found such a strategy by turning to parallel text as a source of testing and training materials. Most of the previous work falls into one of three camps: (1) Qualitative Methods, e.g., Hirst (1987), (2) Dictionary-based Methods, e.g., Lesk (1986), and (3) Hand Annotated Corpora, e.g., Kelly and Stone (1975). In each case, the work has been limited by knowledge acquisition bottleneck. 2.1 Qualitative Methods For example, there has been a tradition in parts of the AI community of building large experts by hand, e.g., Granger (1977), Rieger (1977), Small and Rieger (1982), Hirst (1987). Unfortunately, this approach is not very easy to scale up, as many researchers have observed: ‘‘The expert for THROW is currently six pages long, ... but it should be 10 times that size’’ (Small and Reiger, 1982). Since this approach is so difficult to scale up, much of the work has had to focus on ‘‘toy’’ domains (e.g., Winograd’s Blocks World) or sublanguages (e.g., Isabelle (1984), Hirschman (1986)). Currently, it is not possible to find a semantic network with the kind of broad coverage that would be required for unrestricted text. __________________ 1. This thesaurus should not be confused with the much smaller and less up-to-date 1911 edition of Roget’s. From an AI point of view, it appears that the word-sense disambiguation problem is ‘‘AI-Complete,’’ meaning that you can’t solve this problem until you’ve solved all of the other hard problems in AI. Since this is unlikely to happen any time soon (if at all), it would seem to suggest that word-sense disambiguation is just too hard a problem, and we should spend our time working on a simpler problem where we have a good chance of making progress. Rather than accept this rather pessimistic conclusion, we prefer to reject the premise and search for an alternative point of view. 2.2 Machine-Readable Dictionaries (MRDs) Others such as Lesk (1986), Walker (1987), Ide and Veronis (1990) have turned to machine-readable dictionaries (MRD) such as Oxford’s Advanced Learner’s Dictionary of Current English (OALDCE) in the hope that MRDs might provide a way out of the knowledge acquisition bottleneck. These researchers seek to develop a program that could read an arbitrary text and tag each word in the text with a pointer to a particular sense number in a particular dictionary. Unfortunately, the approach doesn’t seem to work as well as one might hope. Lesk (1986) reports accuracies of 50-70% on short samples of Pride and Prejudice. Part of the problem may be that dictionary definitions are too short to mention all of the collocations (words that are often found in the context of a particular sense of a polysemous word). In addition, dictionaries have much less coverage than one might have expected. Walker (1987) reports that perhaps half of the words occurring in a new text cannot be related to a dictionary entry. Thus, like the AI approach, the dictionary-based approach is also limited by the knowledge acquisition bottleneck; dictionaries simply don’t record enough of the relevant information, and much of the information that is stored in the dictionary is not in a format that computers can easily digest, at least at present. 2.3 Approaches Based on Hand-Annotated Corpora A third line of research makes use of hand-annotated corpora. Most of these studies are limited by the availability of hand-annotated text. Since it is unlikely that such text will be available in large quantities for most of the polysemous words in the vocabulary, there are serious questions about how such an approach could be scaled up to handle unrestricted text. Nevertheless, we are extremely sympathetic with the basic approach, and will adopt a very similar strategy ourselves. However, we will introduce one important difference, the use of parallel text in lieu of hand-annotated text, as suggested by Brown et al. (1991a), Dagan et al. (1991) and others. Kelly and Stone (1975) constructed 1815 disambiguation models by hand, selecting words with a frequency of at least 20 in a half million word corpus. Most subsequent work has sought automatic methods because it is quite labor intensive to construct these rules by hand. Weiss (1973) first built rule sets by hand for five words, then developed automatic procedures for building similar rule sets, which he applied to additional three words. Unfortunately, the system was tested on the training set, so it is difficult to know how well it actually worked. Black (1987, 1988) studied five 4-way polysemous words using about 2000 hand tagged concordance lines for each word. Using 1500 training examples for each word, his program constructed decision trees based on the presence or absence of 81 ‘‘contextual categories’’ within the context of the ambiguous word. He used three different types of contextual categories: (1) subject categories from LDOCE, the Longman Dictionary of Contemporary English (Longman, 1978), (2) the 41 vocabulary items occurring most frequently within two words of the ambiguous word, and (3) the 40 vocabulary items excluding function words occurring most frequently in the concordance line. Black found that the dictionary categories __________________ 2. The context was defined to be the concordance line, which we estimate to be about ± 6 words from the ambiguous word, given that his 2000 concordance lines contained about 26,000 words. produced the weakest performance (47 percent correct), while the other two were quite close at 72 and 75 percent correct, respectively. There has recently been a flurry of interest in approaches based on hand-annotated corpora. Hearst (1991) is a very recent example of an approach somewhat like Black (1987, 1988), Weiss (1973) and Kelly and Stone (1975), in this respect, though she makes use of considerably more syntactic information than the others. Her performance also seems to be somewhat better than the others’, though it is difficult to compare performance across systems. 3. An Information Retrieval (IR) Approach to Sense Disambiguation We have been experimenting with an Information Retrieval approach to sense disambiguation. In the training phase, we collect a number of instances of sentence that are translated as peine, and a number of instances of sentence uses that are translated as phrase. Then in the testing phase, we are given a new instance of sentence, and are asked to assign the instance to one of the two senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances. Basically we are treating contexts as analogous to documents in an information retrieval setting. Just as the probabilistic retrieval model (van Rijsbergen, 1979, chapter 6; Salton, 1989, section 10.3) sorts documents d by score(d) = token in d Π Pr(tokenirrel) Pr(tokenrel) _ ____________ we will sort contexts c by score(c) = token in c Π Pr(tokensense2 ) Pr(tokensense1 ) _ ______________ where Pr(tokensense) is an estimate of the probability that token appears in the context of sense1 or sense2. Contexts are defined to extend 50 words to the left and 50 words to the right of the polysemous word in question for reasons that will be discussed in section 5. This model ignores a number of important linguistic factors such as word order and collocations (correlations among words in the context). Nevertheless, there are 2V ∼∼ 200 , 000 parameters in the model. It is a non-trivial task to estimate such a large number of parameters, especially given the sparseness of the training data. The training material typically consists of approximately 12,000 words of text (100 words words of context for 60 instances of each of two senses). Thus, there are more than 15 parameters to be estimated from for each data point. Clearly, we need to be fairly careful given that we have so many parameters and so little evidence. 3.1 Using Global Probabilities to Smooth the Local Probabilities In principle, the conditional probabilities, Pr(toksense), can be estimated by selecting those parts of the entire corpus which satisfy the required conditions (e.g., 100-word contexts surrounding instances of one sense of duty), counting the frequency of each word, and dividing the counts by the total number of words satisfying the conditions. However, this estimate, which is known as the maximum likelihood estimate (MLE), has a number of well-known problems. In particular, it will assign zero probability to words that do not happen to appear in the sample. Zero is not only a biased estimate of their true probability, but it is also unusable for the sense disambiguation task. In order to avoid these problems, we have decided to use information from the entire corpus in addition to information from the conditional sample in order. We will estimate Pr(toksense) by interpolating between local probabilities computed within the 100-word context and global probabilities computed over the entire corpus Pr(tok). The local probabilities are more relevant and the global probabilities are better measured. We seek a trade-off between random measurement errors and bias errors. This is accomplished by estimating the relevance of the larger corpus to the conditional sample in order to find the optimal trade off between random error and bias. See Gale et al. (to appear) for further details. 3.2 An Example Table 2 (below) gives a sense of what the interpolation procedure does for some of the words that play an important role in disambiguating between the two senses of duty in the Canadian Hansards. Table 2 lists the 15 words with the largest product (shown as the first column) of the model score (the second column) and the frequency in the 6000 word training corpus (the third column). The conditioned samples are obtained by extracting a 100-word window surrounding each of the 60 training examples. The training sets were selected by randomly sampling instances of duty in the Hansards until 60 instances were found that were translated as droit and 60 instances were found that were translated as devoir. The first set of 60 are used to construct the model for the tax sense of duty and the second set of 60 are used to construct the model for the obligation sense of duty. The column labeled ‘‘freq’’ shows the number of times that each word appeared in the conditioned sample. For example, the count of 50 for the word countervailing indicates that countervailing appeared 50 times within the conditioned sample. This is a remarkable fact, given that countervailing is a fairly unusual word. It is much less surprising to find a common word like to appearing quite often (228 times) in the other conditioned sample. The second column (labeled ‘‘weight’’) models the fact that 50 instances of countervailing are more surprising than 228 instances of to. The weights for a word are its log likelihood in the conditioned sample compared with its likelihood in the global corpus. The first column, the product of these log likelihoods and the frequencies, is a measure of the importance, in the training set, of the word for determining which sense the training examples belong to. Note that words with large scores do seem to intuitively distinguish the two senses, at least in the Canadian Hansards. There are obviously some biases introduced by the unusual nature of this corpus, which is hardly a balanced sample of general language. For example, the set of words listed in Table 2 under the obligation sense of duty is heavily influenced by the fact that the Hansards contain a fair amount of boilerplate of the form: ‘‘Mr. speaker, pursuant to standing order..., I have the honour and duty to present petitions duly signed by... of my electors....’’ _ ________________________________________________________________________________ Table 2: Selected Portions of the Two Models for the Two Senses of duty tax sense of duty obligation sense of duty weight*freq weight freq word weight*freq weight freq word _ ________________________________________________________________________________ 285 5.7 50 countervailing 64 3.2 20 petitions 111.8 4.3 26 duties 59.28 0.26 228 to 99.9 2.7 37 u.s 56.28 0.42 134  73.1 1.7 43 trade 51 3 17 petition 70.2 1.8 39 states 47.6 2.8 17 pursuant 69.3 3.3 21 duty 46.28 0.52 89 mr 68.4 3.6 19 softwood 37.8 2.7 14 honour 68.4 1.9 36 united 37.8 1.4 27 order 58.8 8.4 7 rescinds 36 2 18 present 54 3 18 lumber 33.6 2.8 12 proceedings 50.4 4.2 12 shingles 31.5 3.5 9 prescription 50.4 4.2 12 shakes 31.32 0.87 36 house 46.8 3.6 13 35 29.7 3.3 9 reject 46.2 2.1 22 against 29.4 4.2 7 boundaries 41.8 1.1 38 canadian 28.7 4.1 7 electoral _ ________________________________________________________________________________                                                                 

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining Machine Readable Lexical Resources and Bilingual Corpora for Broad Word Sense Disambiguation

This paper describes a new approach to word sense disambiguation (WSD) based on automatically acquired "word sense division. The semantically related sense entries in a bilingual dictionary are arranged in clusters using a heuristic labeling algorithm to provide a more complete and appropriate sense division for WSD. Multiple translations of senses serve as outside information for automatic tag...

متن کامل

A Word Sense Disambiguation Method Using Bilingual Corpus

This paper proposes a word sense disambiguation (WSD) method using bilingual corpus in English-Chinese machine translation system. A mathematical model is constructed to disambiguate word in terms of context phrasal collocation. A rules learning algorithm is proposed, and an application algorithm of the learned rules is also provided, which can increase the recall ratio. Finally, an analysis is...

متن کامل

Word Translation Disambiguation Using Bilingual Bootstrapping

This article proposes a new method for word translation disambiguation, one that uses a machinelearning technique called bilingual bootstrapping. In learning to disambiguate words to be translated, bilingual bootstrapping makes use of a small amount of classified data and a large amount of unclassified data in both the source and the target languages. It repeatedly constructs classifiers in the...

متن کامل

Building Specialized Bilingual Lexicons Using Word Sense Disambiguation

This paper presents an extension of the standard approach used for bilingual lexicon extraction from comparable corpora. We study the ambiguity problem revealed by the seed bilingual dictionary used to translate context vectors and augment the standard approach by a Word Sense Disambiguation process. Our aim is to identify the translations of words that are more likely to give the best represen...

متن کامل

Translation Disambiguation Using Bilingual Bootstrapping

This article proposes a new method for word translation disambiguation, one that uses a machinelearning technique called bilingual bootstrapping. In learning to disambiguate words to be translated, bilingual bootstrapping makes use of a small amount of classified data and a large amount of unclassified data in both the source and the target languages. It repeatedly constructs classifiers in the...

متن کامل

Vector Disambiguation for Translation Extraction from Comparable Corpora

We present a new data-driven approach for enhancing the extraction of translation equivalents from comparable corpora which exploits bilingual lexico-semantic knowledge harvested from a parallel corpus. First, the bilingual lexicon obtained from word-aligning the parallel corpus replaces an external seed dictionary, making the approach knowledge-light and portable. Next, instead of using simple...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992